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ABSTRACT 

The recent SARS epidemic has boosted interest 
in the discovery of novel human and animal coro- 
naviruses. By July 2007, more than 3000 coronavirus 
sequence records, including 264 complete gen¬ 
omes, are available in GenBank. The number of 
coronavirus species with complete genomes avail¬ 
able has increased from 9 in 2003 to 25 in 2007, 
of which six, including coronavirus HKU1, bat 
SARS coronavirus, group 1 bat coronavirus HKU2, 
groups 2c and 2d coronaviruses, were sequenced 
by our laboratory. To overcome the problems we 
encountered in the existing databases during com¬ 
parative sequence analysis, we built a comprehen¬ 
sive database, CoVDB (http://covdb.microbiology. 
hku.hk), of annotated coronavirus genes and gen¬ 
omes. CoVDB provides a convenient platform for 
rapid and accurate batch sequence retrieval, the 
cornerstone and bottleneck for comparative gene 
or genome analysis. Sequences can be directly 
downloaded from the website in FASTA format. 
CoVDB also provides detailed annotation of all 
coronavirus sequences using a standardized 
nomenclature system, and overcomes the problems 
of duplicated and identical sequences in other 
databases. For complete genomes, a single repre¬ 
sentative sequence for each species is available for 
comparative analysis such as phylogenetic studies. 
With the annotated sequences in CoVDB, more 
specific blast search results can be generated for 
efficient downstream analysis. 


INTRODUCTION 

Coronaviruses are found in a wide variety of animals and 
are associated with respiratory, enteric, hepatic and 
neurological diseases of varying severity. Based on 
genotypic and serological characterization, coronaviruses 


were divided into three distinct groups (1-3). As a result of 
the unique mechanism of viral replication, coronaviruses 
have a high frequency of recombination (2,4). 

The recent severe acute respiratory syndrome (SARS) 
epidemic, the discovery of SARS coronavirus (SARS- 
CoV) and identification of SARS-CoV-like viruses from 
Himalayan palm civets and a raccoon dog from wild live 
markets in China have led to a boost in interest on 
discovery of novel coronaviruses in both humans and 
animals (5-9) (Figure 1). For human coronaviruses, a 
novel group 1 human coronavirus, human coronavirus 
NL63 (HCoV-NL63) was reported in 2004 (10,11), while 
we described the discovery, complete genome sequence and 
genetic diversity of a novel group 2 human coronavirus, 
coronavirus HKU1 (CoV-HKUl) in 2005 (4,12-14). 
As for animal coronaviruses, six group 1 (15-17), four 
group 2, including bat SARS-CoV and two new subgroups 
of group 2 coronaviruses (6,8,18,19), and 11 group 3 
(20-23) coronaviruses have recently been described. 

By July 2007, more than 3000 coronavirus sequence 
records, including a total of 264 complete genomes, are 
available in GenBank (24). Among the 25 coronavirus 
species with complete genome sequence available, six were 
sequenced by our group, including CoV-HKUl and bat 
SARS-CoV (13,16,18,19). Furthermore, we defined two 
novel subgroups of group 2 coronavirus (18). During the 
process of batch sequence retrieval for comparative 
genome analysis of the coronavirus genomes that we 
sequenced, we encountered several major problems about 
the coronavirus sequences in GenBank as well as other 
coronavirus databases (Coronaviridae Bioinformatics 
Resource, http: //athena.bioc.uvic.ca/database.php?db = 
coronaviridae; PATRIC http://patric.vbi.vt.edu) (25). 
First, in GenBank, the non-structural proteins in the 
polyprotein encoded by orflab were not annotated. 
Second, in all databases, for the non-structural proteins 
encoded by ORFs downstream to orflab, the annotations 
are often confusing because they are not annotated using 
a standardized system. Third, multiple accession numbers 
are often present for reference sequences (26). These 
problems often lead to confusion when sequence retrieval 
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Figure 1 . Number of coronavirus sequences in GenBank from 1984 to 
2006. 

is performed. Fourth, coronaviruses, especially SARS- 
CoV, amplified from different specimens may contain 
the same genome or gene sequences. These sequences 
usually lead to redundant work when they are analyzed. 

In view of these problems, we started to develop our 
own database for coronavirus gene and genome sequences 
in 2005. In this database, CoVDB, we sought to create 
a user-friendly platform for efficient batch sequence 
retrieval, which is crucial for comparative genome 
analysis. In this article, we describe this comprehensive 
database of annotated coronavirus genes and genomes, 
which provides a central source of information about 
coronaviruses. To further increase the usefulness of 
CoVDB, commonly used bioinformatics tools were also 
included for analysis of the sequence data. 

MATERIALS AND METHODS 

Database description 

Sequence data. CoVDB is a web-based coronavirus 
database. Data of CoVDB is stored and managed by 
MySQL database management system. By July 2007, 
CoVDB contains 3982 coronavirus sequences and one 
torovirus genome sequence. Two hundred and sixty-four 
of them are complete genomes and the rest are partial 
genomes or genes. All data were retrieved from GenBank 
using modules of bioperl. We annotated sequences 
without gene information or non-structural protein 
boundary and labeled the 5' and 3' untranslated regions 
(UTRs) of the genomes. By July 2007, CoVDB contains 
12 344 genes and UTRs. 

Information on coronavirus genome characteristics. In 
addition to the two sequence retrieval pages, CoVDB 
collects information on coronavirus sequence character¬ 
istics, including genome organization, a brief description 
on each complete coronavirus genome, GC content, 
polyprotein cleavage sites, transcription regulatory 
sequences, acidic tandem repeat sequences and known 
RNA structures. These pieces of information can be 
accessed by clicking ‘Genome’ in the top menu bar of 
CoVDB. In the ‘Tools’ page, blast similarity search (27) 
against annotated coronavirus sequences in CoVDB can 
be performed and other commonly used tools are also 
provided. 


Functionality of the database 

Batch sequence retrieval. The main goal for setting 
up CoVDB is to provide a convenient and efficient 
platform for retrieving batches of coronavirus gene 
sequences. The interfaces of the database are simple and 
user friendly. All genes and genomes contain links to 
GenBank and/or pubmed. CoVDB contains two main 
pages for sequence retrieval. From the homepage, one 
can enter the first main page for retrieval of complete 
genomes and their genes by clicking ‘CoVDB’ (Figure 2a). 
From this page, users can obtain genes from specific 
coronavirus species by selecting the corresponding 
check boxes. We defined one representative genome 
from each species as the ‘Type strain’. Most of the time, 
this ‘Type strain’ is the one assigned as the reference 
sequence in GenBank. By choosing the ‘Type strain only’ 
option, users can obtain one gene sequence per species 
and construct phylogenetic tree or perform other compar¬ 
isons. An example of retrieving complete genome or a 
specific gene of complete genome of selected species is 
shown in Figure 2b and c. 

From the page for retrieval of complete genomes and 
their genes, one can enter the second main page for 
retrieval of all complete and/or incomplete genes of a 
coronavirus (Figure 3a) by clicking ‘From all groups of 
genes’. In this page, all the gene sequences are grouped 
vertically according to which coronavirus group and 
subgroup they belong to, and horizontally by the names 
of the genes. The option ‘Exclude partial CDS’ can be 
used if only complete genes are required. An example of 
retrieving all the sequence of a particular gene for a group 
of coronavirus is shown in Figure 3b. If the translated 
sequence of a selected gene has more than one stop 
codon which is probably due to sequencing error, the 
number in the ‘Length’ column of this gene will be marked 
in red. 

Polyprotein annotation. In all coronavirus genomes, 
orflab occupies two-thirds of the genome and it is 
translated as a polyprotein. This polyprotein is post- 
translationally cleaved by 3C-like protease (3CL pro ) and 
papain-like protease (PL pro ) into 15-16 non-structural 
proteins. Some of the non-structural proteins, such as 
RNA-dependent RNA polymerase, helicase, 3CL pro and 
PL pro are essential for replication or virulence of the 
coronavirus, although the functions of others are still 
unclear. Due to the essentiality of the non-structural 
proteins, these sequences are often used for evolutionary 
analysis, primer design, etc. However, except for the 
reference sequences, detailed cleavage site information is 
not provided for the non-structural proteins in other 
sequences in GenBank. Since it has been shown that 
3CL pro and PL pro of coronavirus cleave at conserved 
specific amino acids, the putative cleavage sites of 
the 15-16 non-structural proteins can be predicted by 
multiple sequence alignment. Using these pieces of 
information, we have annotated these non-structural 
proteins in all the coronavirus sequences for easy retrieval 
in CoVDB. 
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Protein/gene name unification. By convention, all non- 
structural proteins in the polyprotein encoded by orflab 
are named as ‘nsp’, with each protein numbered 
consecutively starting from the 5' end (nspl-nspl6). 
The structural proteins after the polyprotein are hemag¬ 
glutinin esterase (HE, in group 2a coronaviruses), spike 
glycoprotein (S), envelope protein (E), membrane protein 
(M) and nucleocapsid protein (N). However, there is 


" * 


no unified naming system for the non-structural proteins 
encoded by ORFs downstream to orflab. This lack of 
a unified system greatly reduces the stability and accuracy 
of ortholog retrieval. 

In CoVDB, with the aim of facilitating gene retrieval, 
we tried to unify the naming of these non-structural 
proteins from different groups of coronaviruses. On the 
other hand, we have also tried to avoid radical changes 


| Genome | Tools | Publications | Contact | Links 


Sequence retrieve 

Get genes from 264 completed genomes only 


From all groups of genes 1 ^ 


Group 1: 

□ HCoV-229E (1) 


□ Bat-CoV HKU2 (4) 

Group 2a: 

0HCOV-OC43 (S) 


□ sable antelope (1) 

SARS-CoV: 

0 Human (131) 
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(b) Thanks for searching COV db. Here are the results: 



NCBIacc Shortname 

Length Strain/isolate Country:Region 

Organism 

Group PMID 

GC 

GCskew 

□ 

NC_005147 HCOV-OC43 

30738 

OC43 

Belgium 

Human coronavirus OC43 

G2a 

15650185 

0.368 

0.176 

□ 

NC_006577 CoV-HKUlA 

29926 

N1 

China:Hong Kong Human coronavirus HKU1A 

G2a 

15613317 

0.320 

0.188 

□ 

NC_003045 BCoV 

31028 

BCoV-ENT 

USA 

Bovine coronavirus 

G2a 

11714968 

0.371 

0.174 

□ 

NC_001846 MHV 

31357 

MHV-A59 

USA 

Murine hepatitis virus 

G2a 

9426441 

0.417 

0.142 

□ 

NC_007732 PHEV 

30480 

VW572 

Belgium 

Porcine hemagglutinating encephalomyelitis 

virus G2a 

16809333 

0.372 

0.164 

□ 

NC_004718 SARS-human 297S1 

Tor2 

Canada:Toronto 

Human SARS coronavirus 

G2b 

15020242 

0.407 

0.020 

□ 

AY304488 SARS-civet 

29731 

SZ16 

China: Shenzhen 

Civet SARS coronavirus 

G2b 

12958366 

0.408 

0.020 

□ 

DQ022305 SARS-bat 

29728 

HKU3-1 

China:Hong Kong Bat SARS coronavirus 

G2b 

16169905 

0.411 

0.027 


8 records! 

□ select all | Get selected genomes | | Reset | 

Figure 2. Screenshots of CoVDB complete genome retrieval pages, (a) Specific gene can be retrieved using the pull-down list at the left lower corner. 
The number in brackets indicates the number of complete genomes for that coronavirus. (b) Example of showing genomes of selected species (some 
group 2a coronaviruses and SARS-CoV-related coronaviruses). Default is to show the ‘Type strain’ for each species only. The columns NCBIacc and 
PMID link to GenBank and pubmed, respectively, (c) Example of showing S gene of selected species by choosing S in the pull-down list. For genes 
downstream to orflab, sequences upstream to the initiation codons can also be retrieved from this result page. This function is particularly useful for 
the detection of transcription regulatory sequences. 
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(c)Thanks for searching COV db. Here are the results: 


ntacc Shortname Tag Gene From To Shift Length Proteinjd Strain/Isolate Group Country:Region 


□ NC_005147 HC0V-0C43 

CDS S 

23644 27729 0 

4086 

NP_937950 OC43 

G2a 

Belgium 

□ NC_006577 CoV-HKUlA 

CDS S 

22942 27012 0 

4071 

YP_173238 N1 

G2a 

China:Hong Kong 

□ NC_003045 BCoV 

CDS S 

23641 27732 0 

4092 

NP_150077 BCoV-ENT 

G2a 

USA 

□ NCJH01846 MHV 

CDS S 

23929 27903 0 

3975 

NP_045300 MHV-AS9 

G2a 

USA 

□ NC_007732 PHEV 

CDS S 

23427 27476 0 

4050 

YP_459952 VWS72 

G2a 

Belgium 

□ NC_004718 SARS-human CDS S 

21492 2S2S9 0 

3768 

NPJ328851 Tor2 

G2b 

Canada:Toronto 

□ AY304488 SARS-civet 

CDS S 

21477 25244 0 

3768 

S216 

G2b 

China: Shenzhen 

□ DQ022305 SARS-bat 

CDS S 

21471 25199 0 

3729 

AAY83866 HKU3-1 

G2b 

China:Hong Kong 


8 records! 
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Figure 2. Continued. 


in the names that may lead to confusion. In CoVDB, 
these non-structural proteins are named as NS2a, NS3x, 
NS4x, NS5x and NS7x (x = a, b, c,...). NS2a denotes the 
ORF between orflab and HE of group 2a coronaviruses. 
NS3x denotes the ORFs between S and E of groups 1, 2c, 
2d and 3 coronaviruses. In most of these coronaviruses, 
there are two NS3x, named NS3a and NS3b. However, 
in group 1 coronaviruses, the genomes of some members 
(e.g. HCoV-NL63, PEDV) contain only one ORF between 
S and E. When we compared their putative amino acid 
sequences to the corresponding ones in other group 1 
coronavirus genomes using BLAST, as well as searching 
for conserved domains using motifscan, results showed 
that the putative proteins encoded by these ORFs 
belonged to a protein family in Pfam originally assigned 
as ‘Corona_NS3b’ (accession number PF03053). 
Therefore, we named these ORFs as NS3b. NS4x denotes 
the ORFs between S and E of group 2a coronaviruses. 
NS5x denotes the ORFs between M and N of group 3 
coronaviruses. One exception is NS5a of group 2a 
coronaviruses. Traditionally, this name denotes an ORF 
upstream of E in group 2a coronaviruses. Therefore, 
we have kept this name for that ORF in CoVDB. NS7x 
denotes the ORFs downstream of N gene. It is important 
to note that due to variations in genome organizations 
among different groups of coronaviruses (Table 1), 
NS genes with the same name in different coronavirus 
groups may not be orthologs of each other. The complete 
genome gene search page of CoVDB contains a link to a 
Gene synonyms page, which includes a list of synonymous 
names of the various genes in the coronavirus genomes. 

Identical sequence labeling. Sequence redundancy is 
another problem of coronavirus sequences in public 
nucleotide databases. Different strains of the same species 
from samples collected in different locations or at different 


times may possess completely or partially identical 
sequences. These sequences, though containing important 
epidemiological information, increase the workload 
during sequence analysis. In CoVDB, we compared all 
nucleotide sequences and labeled the identical ones to 
mitigate this problem. Users can choose to show or not to 
show strains with identical sequences by clicking on the 
check boxes to the left of the page (Figure 3b). 

Blast similarity search. During the process of coronavirus 
gene sequences analysis, we encountered a major problem 
when coronavirus gene sequences, especially those of 
orflab, were used for blast search against GenBank or any 
other coronavirus databases. When part of the orflab 
gene (e.g. nsp5) is used as the query sequence, instead 
of getting the gene for the specific non-structural protein 
that the query sequence is homologous to, the results 
will only show that the hits are within orflab, or in some 
cases, shown to be within the entire coronavirus genome. 
Much time will be needed for further analyzing the 
results manually in order to locate the positions of the 
cleavage sites of the corresponding genes for the non- 
structural proteins, making it very inefficient for further 
downstream work. 

This problem has been overcome by the annotated 
sequences in CoVDB. The blast search page of CoVDB 
is an interface for facilitating coronavirus similarity 
search. The background support program, blastall, is 
from the NCBI Blast package. The blast search page 
can be entered by clicking Tools’ in the top menu bar in 
any page of CoVDB. Since all sequences in CoVDB are 
annotated, they can be grouped into different datasets 
for blast search. Users can choose one of the three 
nucleotide and two protein sequence datasets as the 
database for comparison (Figure 4). The three nucleotide 
sequence datasets are: CoV genes (nsp + genes after lab), 
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■ 

- 
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□9 
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- 
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- 
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□ 7b: 198 
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nspl3 
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□ 19 

□ 4 
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- 

- 
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- 

- 

- 
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□ 34 

□ 5S 
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□9 

□ 4 

□ 15 


- 

- 
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- 

- 

- 
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□ 30 

□ S3 
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□9 

□ 4 
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- 

- 
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- 

- 

- 
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□ 34 

□ 54 
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□9 

□ 4 

□ 17 
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Figure 3. Screenshots of all gene retrieval pages, (a) Gene sequences are grouped vertically according to which coronavirus group and subgroup they 
belong to, and horizontally by the name of the genes. The numbers next to each checkbox indicates the number of that gene in CoVDB. The option 
'Exclude partial CDS’ can be used if only complete genes are required, (b) Example of showing the 15 sequences of nspl3 in group 3 coronaviruses. 
The first column is CoVDB gene id. In the Uniq column, ‘Uniq’ will be shown if there is no other identical sequence in CoVDB. Otherwise, gene id 
of the sequences identical to it will be shown. 


CoV genes (lab + genes after lab) and CoV GenBank 
strains, which are the original sequences retrieved from 
GenBank. The two protein sequence datasets are the 
translated sequences of the first two nucleotide datasets: 
CoV proteins (nsp + aa after lab) and CoV proteins 
(lab + aa after lab). 

MyBlast. ‘MyBlast’ employs the same blast program 
as the Blast page mentioned above. However, instead of 
selecting a predefined nucleotide or amino acid sequence 
database, multiple sequences can be pasted into the second 
sequence input box to generate a temporary sequence 
database. One or more query sequences can be pasted 
into the first sequence input box for blastn or blastp search 
against the temporary sequence database. 

ORF finder for coronavirus. This ORF finder is specifi¬ 
cally designed for coronavirus genome analysis. The result 
page shows the positions and lengths of each putative 


ORF and the position of the putative ribosomal frame- 
shift site for translation of orflab. The nucleotide 
or amino acid sequences of the ORFs can be shown 
by selecting the corresponding check boxes. To facilitate 
genome comparison and annotation, the most closely 
related coronavirus, which had been annotated in 
CoVDB, can be chosen from a pull-down list for 
comparison using blast search. This function is particu¬ 
larly useful for determining the range of nsp in orflab. 


DISCUSSION 

Rapid and accurate batch sequence retrieval is both the 
cornerstone and bottleneck for comparative gene or 
genome analysis. During the process of complete 
genome sequencing and comparative analysis of the 
various novel human and animal coronavirus genomes 
in the past 2 years, we have developed a comprehensive 
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(b)Thanks for searching COV db. Here are the results: 

- NA: not available 

- Shift: -1 frameshift position. 

- Type: c-complete CDS, p-partial CDS. 

- Length: Red number means this sequence may contain multiple stop codons due to sequencing or mutation. 

- Country: Region: If no virus location is availabe, this indicates the submitters' country and region. 

- Please let me know if there is any incorrect message. Thanks! 

IS! 



ntacc 

Gene From To Shift 

Length Type Protein. 

jd Short 

Strain/Isolate 

Country: Region 

Uniq PMID 

0 5307 

AY319651 

nspl3 15163 16965 0 

1803 

C 

IBV 

BJ 

China: Beijing 

Uniq 0 

05332 

AY514435 
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Figure 3. Continued. 


Table 1. Genome organization of different groups of coronavirus 
Group Organizations 

1 5'UTR-nsp 1 -16-S-NS3x-E-M-N-(NS7x)-3'UTR 

2a 5'UTR-nsp 1 -16-(NS2a)-HE-S-(NS4x)-NS5a-E-M-N-3'UTR 

2b 5'UTR-nsp 1 -16-S-sars3x-E-M-sars6-sars7x-sars8x-N-3'UTR 

2c 5'UTR-nsp 1 -16-S-NS3x-E-M-N-3'UTR 

2d 5'UTR-nsp 1 -16-S-NS3x-E-M-N-(NS7x)-3'UTR 

3 5'UTR-nsp 1 -16-S-NS3x-E-M-NS5x-N-(NS7x)-3'UTR 


database, CoVDB, of annotated coronavirus genes 
and genomes, which offers efficient batch sequence 
retrieval and analysis. As shown by our experience in 
using CoVDB for comparative genome analysis of 
novel coronaviruses we have discovered (4,13,16,18,19), 
we find that CoVDB is more rapid and efficient than 
other existing coronavirus databases for batch sequence 
retrieval for the following reasons. First, we have 
performed annotation on all non-structural proteins in 
the polyprotein encoded by orflab of every single 
sequence. Second, annotation was performed for the 
non-structural proteins encoded by ORFs downstream 
to orflab using a standardized system, with some 
exceptions given to some names that have been used for 
a long time so as to minimize confusion. Third, all 
sequences with identical nucleotide sequences were labeled 
where one can choose to show or not to show strains 
with identical sequences. Fourth, CoVDB contains not 


only complete coronavirus genome sequences, but also 
incomplete genomes and their genes. Some genes of 
coronaviruses, such as pol, spike and nucleocapsid are 
sequenced much more frequently than others because they 
are either most conserved or least conserved. These gene 
sequences are particularly important for evolutionary 
analysis, single nucleotide polymorphism studies and 
design of primers for RT-PCR or quantitative RT-PCR 
amplification. 

Availability 

CoVDB is constructed by the Department of 
Microbiology, the University of Hong Kong. It is available 
at no charge at http://covdb.microbiology.hku.hk. 
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